[contrib] Add MiMo-V2.5-Pro (Xiaomi, 384 experts MoE, FP8 on Trn2) by whn09 · Pull Request #150 · aws-neuron/neuronx-distributed-inference

whn09 · 2026-04-29T02:39:18Z

Description

Adds a contrib port of XiaomiMiMo/MiMo-V2.5-Pro targeting Trn2 with the FP8 blockwise recipe already used by the MiMo-V2-Flash and MiMo-V2.5 contrib ports (moe_tp_degree=1, moe_ep_degree=64). Self-contained under contrib/models/MiMo-V2.5-Pro/; no upstream source is modified.

Model Information

Model Name: MiMo-V2.5-Pro (Xiaomi)

Model Architecture: Decoder-only MoE transformer, 70 layers, 6144 hidden, 128 Q heads / 8 KV heads, asymmetric Q-K head_dim=192 vs V head_dim=128, hybrid attention (10 full + 60 sliding-window layers), attention sink bias on SWA layers, fused qkv_proj, sigmoid router with noaux_tc, 384 routed experts (top-8), no shared expert.

Purpose: Text generation (general-purpose Chinese/English LLM).

Checklist

Required Components

Accuracy Test (contrib/models/MiMo-V2.5-Pro/test/integration/test_model.py)
- Compiles + loads the model on Neuron and asserts the generated tokens are deterministic.
README.md with:
- Usage Example (direct NxDI + vLLM)
- Compatibility Matrix (Trn2 with SDK 2.29)
- Example Checkpoints (HF link)
- Testing Instructions (pytest ...)
Source Code (contrib/models/MiMo-V2.5-Pro/src/)
- modeling_mimo_v2.py (NxDI modeling wrapper)
- conversion_script/preprocess_mimo_v2_fp8.py (HF OCP FP8 → Neuron FP8 streaming preprocess)

Optional Components

Unit tests — not included; covered by the integration test.

Folder Structure

contrib/models/MiMo-V2.5-Pro/
├── README.md
├── perf_test/
│   ├── 0_setup.sh
│   ├── bench_mimo_v2.sh
│   ├── run_bench_single.sh
│   ├── sanity_check.sh
│   ├── smoke_compile_mimo_v2.py
│   ├── smoke_generate_mimo_v2.py
│   └── vllm-neuron-patch.patch
├── src/
│   ├── modeling_mimo_v2.py
│   └── conversion_script/
│       ├── preprocess_mimo_v2_fp8.py
│       ├── preprocess_mimo_v2_parallel.py
│       └── run_preprocess_parallel.sh
└── test/
    ├── __init__.py
    ├── unit/__init__.py
    └── integration/
        ├── __init__.py
        └── test_model.py

Testing

How did you test this change?

Smoke + benchmark runs on a trn2.48xlarge (Neuron SDK 2.29, PyTorch 2.9, Python 3.12):

Preprocess the HF OCP-FP8 checkpoint to Neuron-FP8 (~20 min).
smoke_compile_mimo_v2.py — compile the model (TP=64, moe_tp=1/moe_ep=64, BS=48, seq_len=1024). First compile ~60 min TKG + ~15 min CTE; cached compile ~1 min.
smoke_generate_mimo_v2.py — 20-token generation via HuggingFaceGenerationAdapter.
bench_mimo_v2.sh — vLLM serving via vllm-neuron 0.5.0 + the patch in perf_test/vllm-neuron-patch.patch. Benchmarked with vllm bench serve --dataset-name random --random-input-len 900 --random-output-len 90.

Test Results:

vLLM serving throughput on trn2.48xlarge, FP8, BS=48, TP=64 / moe_tp=1 / moe_ep=64, continuous batching:

Concurrency	Total tok/s	Output tok/s	TTFT median (ms)	TPOT median (ms)
1	47	4.3	1,392	220
16	391	35.6	2,361	422
48	606	55	7,322	752

Per-stream ITL median holds at ~220 ms across all concurrency levels; growth at higher concurrency is from continuous-batching queue pressure.

Integration test: pytest contrib/models/MiMo-V2.5-Pro/test/integration/test_model.py -v — passes locally on the DLAMI venv (requires the preprocessed checkpoint path; see README).

Compatibility

Tested with:

Neuron SDK Version(s): 2.29
Instance Type(s): trn2.48xlarge (128 physical NeuronCores, logical_nc_config=2 → 64 logical cores)
PyTorch Version: 2.9.1
Python Version: 3.12

Additional Information

HBM footprint: Pro's FP8 expert weights are ~500 GB across 64 ranks; per-rank tensors reach ~20 GB on Trn2's 24 GB HBM, so the recipe is sensitive to collective DMA ring headroom. Any recipe change that adds per-rank tensor size (e.g. BF16 attention weights) can OOM at load.
BS constraint: NxDI's TKG path refuses Expert Parallelism when batch_size < num_experts / top_k. For Pro that is 384 / 8 = 48, so the smallest working BS on the FP8 path is 48.
FP8 numerical behaviour: under the all-FP8 recipe, Pro's attention weights (abs_mean ≈ 0.00124, 4× smaller than V2.5) cause the NKI blockwise FP8 accumulator to drift over 70 layers, producing prompt-dependent gibberish. A BF16-attn recipe (MoE kept FP8, q/k/v dequanted to BF16 via src/conversion_script/repatch_qkv_bf16.py, compiled at seq_len=256 to fit HBM) restores coherent output — verified end-to-end on smoke_generate_mimo_v2.py. This narrows the root cause to the attention path, not the MoE experts. The vLLM scripts in this PR still use the all-FP8 recipe so the bench numbers are from that configuration; re-benchmarking on BF16-attn is queued. See the README Status section for the full write-up.
Preprocess: src/conversion_script/preprocess_mimo_v2_fp8.py streams layer-by-layer (~24 GB peak RAM, ~20 min) and writes <save_path>/model_layer{N}.safetensors plus model_extras.safetensors. A parallel variant is provided in preprocess_mimo_v2_parallel.py.

Related Issues

Related contrib ports in the same family: #137 MiMo-V2-Flash, #148 MiMo-V2.5. Pro reuses the same modeling_mimo_v2.py wrapper and preprocess pipeline; the contrib folder is self-contained to match the per-model layout of the MiMo series.

vLLM Integration

This model is intended for use with vLLM.
Documentation includes vLLM registration instructions.

The perf_test/vllm-neuron-patch.patch adds a _register_contrib_models() hook to vllm-neuron 0.5.0's neuronx_distributed_model_loader.py that:

Registers NeuronMiMoV2ForCausalLM into NxDI's MODEL_TYPES under mimov2flash and mimov2pro.
Overrides vLLM's built-in GPU stubs for MiMoV2FlashForCausalLM / MiMoV2ProForCausalLM in ModelRegistry (they otherwise block ModelConfig validation).
Patches AutoConfig.from_pretrained to default trust_remote_code=True so NxDI's hf_adapter.load_config can load the custom MiMoV2Config that ships with the checkpoint.

No upstream vllm-neuron code is modified — the patch lives in the contrib folder and is applied at install time by perf_test/0_setup.sh.

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

🤖 Generated with Claude Code

…named) Bootstrap contrib entry for XiaomiMiMo/MiMo-V2.5-Pro on Trn2 via NxDI. Same starting point as the MiMo-V2-Pro port: src/modeling_mimo_v2.py (was modeling_mimo_v2_pro.py) src/conversion_script/preprocess_mimo_v2_fp8.py perf_test/{smoke_compile,smoke_generate,bench}_mimo_v2.{py,sh} Rename-only changes in this commit: MiMoV2Pro* identifiers -> MiMoV2* (classes, configs, modules) mimo_v2_pro paths -> mimo_v2 / mimo_v25_pro (compile dirs) HF repo XiaomiMiMo/MiMo-V2-Pro -> XiaomiMiMo/MiMo-V2.5-Pro README architecture table updated to V2.5-Pro config (70 layers, 6144 hidden, 128 heads, 384 experts, etc.) README disk footprint updated to match V2.5-Pro actual size (~962GB HF) Not yet adapted to V2.5-specific differences — these still need work: - attention_chunk_size=128 (new in V2.5, not handled in V2-Pro code) - MoE group-limited noaux_tc (n_group, topk_group) — V2.5 config sets 1,1 so it degenerates to plain noaux_tc; the Pro monkey-patch already matches - FP8 recipe verification on V2.5 weights (V2-Pro workarounds may or may not apply: mean-subtract router bias, split_qkv_fused interleaved layout, blockwise scale stride fix) Subsequent commits will adapt each of the above after validation on Trn2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

P0 fix: - MiMoV2InferenceConfig now stashes `dense_intermediate_size` from HF's `intermediate_size` BEFORE overwriting `self.intermediate_size` with the MoE value. MiMoV2MLP reads this explicit field instead of the brittle `config.intermediate_size * 8` fallback (which happened to equal 16384 for V2.5-Pro by coincidence). P3 — stale V2-Pro / Flash comments updated: - attention_value_scale comments: "(0.707 for Flash)" → "(0.612 for V2.5-Pro)" - convert_mimo_v2_hf_to_neuron_state_dict kv heads comments: V2.5-Pro has num_key_value_heads=8 (same as SWA), not 4 as in V2-Pro. - smoke_compile docstring reworded to drop "Flash BS=1 recipe" wording. - smoke_compile default recipe changed to moe_tp=1/moe_ep=64/BS=48 (per user request: first V2.5-Pro test uses this recipe because it compiles fastest; bug surface on V2-Pro under this recipe was FP8 precision loss in expert MLP weights, which may not reproduce on V2.5). - preprocess router bias comment: noted measured mean=70.906 std=2.4e-4 (identical pathology to V2-Pro, mean-subtract still required). No behavioral change to FP8 monkey-patches or qkv interleaved-group split logic — HF reference diff confirmed V2.5-Pro ships the same interleaved `[16Q|1K|1V]*8` FP8 qkv layout and the same noaux_tc routing (n_group=1, topk_group=1 degenerate to plain noaux_tc). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Record what works and what doesn't on 2026-04-28: - Compile + load succeed on Trn2 (moe_tp=1/ep=64/BS=48 recipe). - Prefill produces coherent English but off-topic output ("100% of the time..." loop for a "explain transformer" prompt). Same signature as V2-Pro's earlier FP8 failures — per-expert weight distribution too narrow for FP8 e4m3 precision. - Note observed token IDs 15/16/4/315/279/882 look suspiciously small but are just " of/ the/ time" etc. — top-BPE English subwords. Greedy decode is correct, the logit distribution itself is wrong. - List recipes still to try (moe_tp=16/ep=4, moe_tp=32/ep=2 etc.) and NxDI constraints that rule out BS=1 when moe_ep>1. Points future debuggers at Jim Burtoft's Flash FP8 observation and his Kimi PR aws-neuron#131 SDK 2.28 recommendation. No code changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Earlier wording said "Pro's expert weight std is too small for FP8 precision" in absolute terms. That's misleading — sglang on H100/H200 runs the exact same OCP FP8 checkpoint and produces correct output, because GPU cutlass/sglang paths dequantize FP8 to BF16 before the matmul. The actual issue appears to be Neuron's NKI blockwise FP8 compute kernel (_bwmm_shard_on_block_nki_call) running FP8 compute directly on subnormal-leaning tensors. Jim Burtoft's Kimi PR aws-neuron#131 names the Neuron SDK 2.29 blockwise kernel as producing "depressed logits with EP=2" and recommends SDK 2.28. Also noted: V2.5-Pro MoE expert weights are byte-identical to V2-Pro (measured layer 1 expert 0 gate_proj stats match to 6 decimals), so all V2-Pro FP8 workarounds remain required — not a new bug. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Parallel preprocess wrapper: - preprocess_mimo_v2_parallel.py: multiprocessing Pool wrapper around preprocess_mimo_v2_fp8.process_layer. Each worker opens its own LazyWeightMap and processes one layer at a time. N_WORKERS default raised to 12 (user request: "越多越好"); 70 layers * ~25 GB peak/layer stays under ~300 GB RAM on a 2 TB trn2.48xl. - run_preprocess_parallel.sh: thin shell wrapper exposing HF_MODEL_PATH, SAVE_PATH, TP_DEGREE, N_WORKERS env vars. Defaults to the 2_9_nxd_inference venv (same one used by the serial preprocess). Wall-clock ~30 min serial → ~5-6 min at 12 workers on fresh cache. README: - Added "NVMe mount" subsection under Prerequisites. trn2.48xl DLAMI assembles four NVMe into RAID0 at /opt/dlami/nvme but does NOT remount automatically after a reboot. Document mdadm --assemble + mount /dev/md0 /opt/dlami/nvme before any path in the recipes resolves. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ipts AWS trn2-llama3.1-405b-speculative FP8 tutorial ("Scenario 2, Step 2") requires XLA_HANDLE_SPECIAL_SCALAR=1 and UNSAFE_FP8FNCAST=1 for OCP-sourced FP8 checkpoints on Neuron. Setting them in both smoke_compile_mimo_v2.py and smoke_generate_mimo_v2.py via os.environ.setdefault (user-level env overrides still win). Note: our preprocess output has 0 bytes in the IEEE-NaN-adjacent range (byte exp=0b1111), verified on layers.1 attn q/k/v and MoE gate_up/down in /opt/dlami/nvme/models/MiMo-V2.5-Pro-Neuron-FP8. So these flags are theoretically optional for our pipeline, but they match the exact surface of AWS's reference FP8 tutorial — cheap safety. Also corrected stale docstrings: smoke_compile now says the NxDI venv (pytorch_2_9_nxd_inference) is the target, not the vllm venv. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…r-only LM HuggingFace tokenizer defaults to padding_side='right', which silently corrupts batched prefill on a causal LM: the last token of each slot becomes a pad token, and the logit used for generating the next token is predicting "what comes after the pad", not "what comes after the real prompt". Observed when running a 6-prompt probe at BS=48: prompts that nearly fill the 267-token batch dimension produced garbage output like "all spaces" (token 220) or random short-id BPE noise. Fix: explicitly set padding_side='left' after tokenizer load. Single-prompt smoke (all slots == same prompt, so no padding triggered) was not affected by this bug, but was producing wrong output for a different reason (the underlying FP8 expert-MLP precision issue). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

V2.5-Pro's attention Q/K/V weights have dequantized abs_mean ~0.001-0.005, roughly 4x smaller than V2.5 (which works). Preprocess has been patched to rewrite the q/k/v_proj tensors in the preprocessed checkpoint as BF16 (matching how o_proj is already handled). Add q_proj/k_proj/v_proj to modules_to_not_convert so NxDI does not try to swap them to QuantizedColumnParallel at convert() time — they remain plain ColumnParallelLinear with BF16 weights. MoE expert weights (gate_up_proj, down_proj) stay FP8 blockwise; their weights saturate the full FP8 ±240 range so quantization is lossless there. Only the attention path goes BF16. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Previous attempt (2a4c9ff) rewrote q/k/v_proj as BF16 to work around Pro's attention weight precision (q_proj abs_mean ~0.00124, 4x smaller than V2.5). Compile succeeded, but load failed with HBM OOM: the BF16 attention weights added ~2 GB per rank, pushing Tensors to 20.93 GB on a 24 GB Neuron HBM and leaving no room for collective DMA rings. Back off on the BF16-attn approach and try a different hypothesis: the NKI blockwise matmul kernel has accumulator precision issues on Pro's MoE expert weights (scale_mean ~5e-5 vs 2.5e-4 on V2.5). Switch blockwise_matmul_config from use_shard_on_block_dynamic_while to use_torch_block_wise=True, which uses a PyTorch fallback that dequantizes each block to BF16 before matmul. Slower but more precise in the accumulator. q/k/v_proj return to FP8 (back out of modules_to_not_convert) so the attention weights don't blow HBM. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Pro is now serveable via vllm-neuron 0.5.0 on Trn2 (TP=64, moe_ep=64, BS=48). Output quality under the FP8 recipe is still prompt-dependent (drift on most prompts, coherent on self-intro style), consistent with Pro's 4-7x smaller MoE FP8 scales compared to V2.5 and the V2-Pro symptom. Changes: - Revert blockwise_matmul_config back to use_shard_on_block_dynamic_while + PING_PONG (Flash/Kimi recipe). The use_torch_block_wise + BF16-attn experiments both OOM on load. - Fix bench_mimo_v2.sh / smoke configs from BS=32 (Flash) to BS=48 (Pro: 384/8=48), plus all accompanying text in the README. - vLLM patch now registers both MiMoV2FlashForCausalLM and MiMoV2ProForCausalLM in vLLM's ModelRegistry, overriding the built-in GPU stubs; patch works against vllm-neuron release-0.5.0. - Point sanity_check.sh, run_bench_single.sh, 0_setup.sh defaults at the Neuron-FP8 checkpoint (not BF16). - Record measured vLLM serving throughput at c=1/16/48 in the README Performance section (replaces stale BF16 numbers). - Rewrite the Status section: document the drift pattern with prompt examples, the recipes that were tried and failed (BF16-attn, torch blockwise), and the two-node BF16 experiment queued next. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

trn2.48xlarge has 16 Trainium2 chips x 8 cores = 128 physical NeuronCores. logical_nc_config=2 halves that to 64 logical cores, which matches tp_degree=64. Previous Prerequisites line said "32 NeuronCores" which is wrong. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…trio Mirror the V2.5 structure so Pro has: - start_vllm_server.sh (new): foreground launcher baking in the full override_neuron_config, persistent NEURON_COMPILED_ARTIFACTS path, and all env-var plumbing. Stays up for ad-hoc curl/sanity. - bench_mimo_v2.sh: rewritten as a one-shot composer (start_vllm_server in background + wait + sanity + run_bench_single at c=1/16/48). Replaces the old inline-launch-with-full-JSON version (~110 lines shorter). - run_bench_single.sh: default CONFIG_NAME/RESULTS_DIR brought in line with bench_mimo_v2.sh and the V2.5 port. README: - Add "Keeping a server up for ad-hoc testing" section and an Environment variables table (NXDI_CONTRIB_MIMO_V2_FLASH_SRC, NEURON_COMPILED_ARTIFACTS, BASE_COMPILE_WORK_DIR, etc.). - Replace the ~60-line inline vllm api_server invocation with pointers to start_vllm_server.sh / bench_mimo_v2.sh; the README no longer duplicates the config that lives in the scripts. - Fix "downloads Flash weights" text in the 0_setup.sh blurb (now downloads Pro Neuron-FP8 weights). - Bench results dir default moved to /opt/dlami/nvme/logs/bench_results/mimo_v2_5_pro/ to align with V2.5. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

"What is 1+1?" drifts to unrelated text under the current FP8 recipe. "Introduce yourself in one sentence." is a high-signal self-identifying prompt that still answers coherently (e.g. "I'm MiMo, developed by Xiaomi LLM Core Team.") and gives a sensible first-run demo. Also drop the explicit `temperature: 0.0` from the request body: vllm-neuron honours the compile-time on_device_sampling_config, not the request-side temperature, so sanity output is always sampled at T=0.6. Note this in a comment. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Root cause of the FP8 drift is narrowed to the attention path, not the MoE experts. Pro's q/k/v weights have abs_mean ~0.00124, 4x smaller than V2.5 (256 experts), and the NKI blockwise FP8 accumulator loses enough precision at this magnitude to drift the logits across 70 layers. Dequantizing q/k/v to BF16 while keeping MoE experts FP8 restores coherent output on smoke_generate, e.g.: <think>Okay, the user is asking for a simple self-introduction in one sentence, with no deeper or hidden needs apparent. As MiMo, based on Xiaomi's self-developed large model, I need to respond in a friendly, positive, and helpful way that aligns with providing assistance ... Changes: - Add src/conversion_script/repatch_qkv_bf16.py (promoted from /opt/dlami/nvme/scripts/), now argparse-driven. Reads HF fused qkv_proj + weight_scale_inv, dequants per kv-head group, writes BF16 q/k/v into the preprocessed Neuron-FP8 checkpoint in place, drops scale entries from the safetensors index. ~22 min runtime. - smoke_compile_mimo_v2.py / smoke_generate_mimo_v2.py: add q_proj/k_proj/v_proj to modules_to_not_convert, drop seq_len from 1024 to 256 (BF16 q/k/v adds ~2 GB per rank; seq_len=1024 OOMed on load last time), switch default COMPILED_PATH to the new BF16-attn directory name to avoid clobbering earlier artifacts. - README: rewrite Status to separate the all-FP8 result (drifted) from the BF16-attn result (coherent); document the required repatch step, the HBM / seq_len trade-off, and a warning that listing q/k/v in modules_to_not_convert without running repatch first produces nonsense (NxDI casts fp8 bytes to bf16 without applying the scale). Update Quick Start to include the repatch step. Flag that vLLM scripts still use the all-FP8 recipe and the bench numbers haven't been re-measured on BF16-attn. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The separate repatch_qkv_bf16.py was a diagnostic workaround: preprocess FP8 first, discover drift, then retro-fit BF16. Now that BF16 attn is the confirmed recipe, fold the per-group dequant directly into the preprocess so there is one script, one output, no "forgot to run repatch" trap. Changes: - preprocess_mimo_v2_fp8.py::split_qkv_fused now returns BF16 per-proj tensors directly (Dict[str, Tensor] instead of Dict[str, Tuple[...]]). The FP8+blockwise path still unwinds the phantom-row padding, then dequants to BF16 in one go. BF16-source path collapses to the same reshape without requant. - Add _dequant_attn_to_bf16() for the Flash-style non-fused q/k/v fallback path; process_layer calls it so those projections also come out BF16. - No compile-time flag or branch for "all-FP8 attn" — that recipe is known broken for Pro (produces gibberish), preserving the branch only invites re-discovering the same trap. - Delete src/conversion_script/repatch_qkv_bf16.py. - README: drop the "Required follow-up: repatch" subsection, simplify the Status writeup (one recipe, one outcome), remove step 3b from Quick Start, clarify in "Preprocess emits BF16 q/k/v" that modules_to_not_convert still needs q/k/v so NxDI routes them through the non-quantized ColumnParallelLinear. - smoke_compile_mimo_v2.py: tighten the inline comment on q/k/v in modules_to_not_convert (no more "Prerequisite: run repatch"). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Previously preprocess / smoke / vLLM-serving used three different venvs depending on which stage of the port we were in; both 2_9_nxd and inference_vllm_0_16 happen to have working NxDI + torch installs, so everything ran but the split was noise. Pick one and stick with it. pytorch_inference_vllm_0_16 is the right choice because: - 0_setup.sh installs vllm-neuron (editable) there, so vllm serving has no alternative. - NxDI direct calls from smoke_compile / smoke_generate also work there (nxdi is preinstalled by the DLAMI in both venvs). - Keeping one venv means no confusion about which python to invoke. Files updated: 0_setup.sh, run_bench_single.sh, smoke_compile_mimo_v2.py and smoke_generate_mimo_v2.py docstrings, run_preprocess_parallel.sh, README Prerequisites. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Status had a 3-way split (all-FP8 vs BF16-attn vs preprocess emits BF16) that made sense during the diagnosis but doesn't once BF16-attn is the only shipping recipe. Collapse it into four focused subsections: * Why BF16 attn + FP8 MoE * Cost and constraints (HBM, seq_len=256, BS>=48, EP constraints) * Recipes tried that did not work (all-FP8, use_torch_block_wise) * Next experiments queued Performance: reframe the vLLM throughput table as a historical all-FP8 capture kept for infra validation and order-of-magnitude reference. The shipping recipe (BF16 attn + seq_len=256) hasn't been re-benchmarked yet; note the expected delta (only q/k/v change, MoE unchanged) so readers can project. vLLM Serving note: since the shipped start_vllm_server.sh still has seq_len=1024 and doesn't list q/k/v in modules_to_not_convert, spell out exactly what to change if the BF16-attn checkpoint OOMs on load. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The preprocess now emits BF16 q/k/v (no .scale entries), so vllm-neuron must route attention through the non-quantized ColumnParallelLinear. Three required changes: - Add q_proj/k_proj/v_proj to modules_to_not_convert. Without this, NxDI tries to load q_proj.scale and bails with "Cannot find layers.0.self_attn.q_proj.scale in state_dict". - Drop seq_len / max_model_len / context_encoding_buckets / token_generation_buckets from 1024 to 256. BF16 q/k/v adds ~2 GB per rank and seq_len=1024 OOMs on load; seq_len=256 is the smoke-verified upper bound. - Move NEURON_COMPILED_ARTIFACTS default to a new path (mimo_v2_5_pro_bs48_moetp1_ep64_bf16attn_seq256_vllm) so it doesn't collide with the old all-FP8 compile dir that's been S3-backed up. Note for longer context: seq_len is the single biggest HBM constraint on this recipe; raising it will require either a smaller batch, a different EP ratio, or cross-instance sharding (see README "Next experiments queued"). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The start_vllm_server.sh now compiles with seq_len=256 (BF16-attn HBM constraint). Pro's default chat template prepends a ~240-token system prompt that by itself busts the bucket, and the old bench default (input 900, output 90) is also way over. sanity_check.sh: - Switch from /v1/chat/completions to /v1/completions with a hand-rolled <|im_start|>user... <|im_end|><|im_start|>assistant frame that tokenises to ~17 tokens. - Do the HTTP POST from python (bash heredoc mangles the \n inside the chat template, which used to make the model emit a garbage first token — UTF-8 replacement char "?" at the start of every reply). - Note in-comment that request-side temperature / top_k / top_p are ignored; the NEFF's on_device_sampling_config wins. run_bench_single.sh: - Default INPUT_LEN 900 -> 180, OUTPUT_LEN 90 -> 60 (180+60 = 240, fits under seq_len=256 with a small margin for random-range-ratio). - Comment explains the seq_len=256 constraint. bench_mimo_v2.sh is unchanged; it delegates length knobs to run_bench_single.sh. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Switch back to /v1/chat/completions with an explicit short system message ("You are MiMo, a helpful assistant..."). apply_chat_template then uses our system turn instead of Pro's ~240-token default, and the prompt comes out to ~25 tokens — well under seq_len=256. This is simpler than the /v1/completions + manually-framed-chat route (no shell-escape \n landmines, native OpenAI API shape) and composes cleanly with other chat clients that assume /v1/chat. Override via SYSTEM=... / PROMPT=... / MAX_TOKENS=... env vars. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

vllm-neuron's own compile path — -O3, --enable-internal-neff-wrapper, on_device_sampling baked into the NEFF, continuous batching — produces garbled first-decode output on Pro: every reply starts with a UTF-8 replacement char and then coherent but completely off-topic text. V2.5 under the same vllm-neuron compile path works fine, so the trigger is Pro-specific (likely SWA + attention sink bias interacting with one of the compile / runtime options above, root cause not isolated). The NxDI-smoke compile path (-O1, no on-device sampler, static batch, produced by perf_test/smoke_compile_mimo_v2.py) does not hit the problem. vllm-neuron can load that NEFF at runtime and serves coherent chat completions with proper `<think>` traces. As a workaround, default NEURON_COMPILED_ARTIFACTS to the smoke compile dir. Users can still override the env var to point at a vllm-neuron-compiled NEFF for testing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…nd)" This reverts commit 935510a.

…LM bug seq_len=512 under the BF16-attn recipe was verified end-to-end (compile + shard + load + 5x deterministic greedy generate) via smoke. HBM fits; seq_len=1024 still OOMs. Also documents the vllm-neuron "first request coherent, subsequent requests garbled" bug (tracked upstream at vllm-project/vllm-neuron#31). Every configuration knob we tried (all-FP8 attn, BF16 attn at 256 or 512, CB on/off, on-device sampling on/off, -O3 -> -O1) reproduced the same symptom on Pro but not on V2.5; the same NEFF serves 5 successive greedy generates byte-identically under smoke_generate_mimo_v2.py, so the bug is in vllm-neuron's runtime, not the NEFF. README changes: - Status opener now says the smoke path is verified and the vLLM serving path is blocked on issue aws-neuron#31. - Bump seq_len=256 references to seq_len=512 in HBM/constraints, Usage example, and the MoENeuronConfig code block. - Rewrite the vLLM "Note" callout to point at issue aws-neuron#31 as the single source of truth for the broken state, drop the obsolete "drop to 256" recovery hints. Script changes: - smoke_compile_mimo_v2.py: SEQ_LEN default 256 -> 512; COMPILED_PATH suffix seq256 -> seq512. Comment rewritten. - smoke_generate_mimo_v2.py: matching SEQ_LEN and COMPILED_PATH default changes so a bare `python smoke_generate_mimo_v2.py` picks up the seq_len=512 NEFF. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

whn09 and others added 23 commits April 28, 2026 05:52

[contrib] MiMo-V2.5-Pro: fix maintainer name typo

d1ea946

Revert "[contrib] MiMo-V2.5-Pro: default vLLM to smoke NEFF (workarou…

6da7188

…nd)" This reverts commit 935510a.

whn09 mentioned this pull request Apr 30, 2026

First chat request coherent, every subsequent request garbled on hybrid-attention MoE (MiMo-V2.5-Pro) vllm-project/vllm-neuron#31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[contrib] Add MiMo-V2.5-Pro (Xiaomi, 384 experts MoE, FP8 on Trn2)#150

[contrib] Add MiMo-V2.5-Pro (Xiaomi, 384 experts MoE, FP8 on Trn2)#150
whn09 wants to merge 24 commits intoaws-neuron:mainfrom
whn09:contrib/MiMo-V2.5-Pro

whn09 commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

whn09 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Additional Information

Related Issues

vLLM Integration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

whn09 commented Apr 29, 2026 •

edited

Loading